-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move BeamSpot transfer to GPU to its own producer #318
Move BeamSpot transfer to GPU to its own producer #318
Conversation
… cudaHostAllocWriteCombined
3ac4c7c
to
0135dbf
Compare
Rebased on top of 10_6_X_Patatrack (HEAD corresponding to #315), and fixed the compilation errors (fixes squashed to the original commits). |
Validation summaryReference release CMSSW_10_6_0_pre2 at 1313262
|
No impact on physics. |
Looks like GitHub is smart enough, no need to rebase this. |
I'm amazed. |
No impact on timing, measured on a T4 over TTbar MC. Before:
After:
|
@makortel can you remind the use case for the non-cached pinned host memory ? |
@fwyzard Your suggestion in #245 (comment) |
Ah, I see, it is a different set of functions because these do not rely on any CUDA stream - correct ? |
Right, no CUDA stream and no caching by our allocator. The difference wrt. the api wrappers |
We seem to have a proliferation of memory allocation functions ... |
Yeah, I'm not too happy about that either. A challenge for supporting flags from the caching allocator is that AFAICT the flags create additional dimension for the binning (in addition to the device index and the allocation size). |
problem with multi-threaded job on data
mc: ok data multi-thread
full log in ls -l ~/data/beamspotproblem.log not necessarily fully reproducible I can try to run with valgrind (ahhhh)
|
btw: I add to add
in cuda_assert.h and make sure it is the last include in each file.... |
the symptoms seems to be cured by
BUT still crashing multi-job
|
I added the printout of #318 (comment), but so far I have been unable to reproduce on
Can you try with |
This will lead to starvation if an EDProducer produces multiple CUDA products that get consumed by downstream. Would just removing the |
On 22 Apr, 2019, at 5:23 PM, Matti Kortelainen ***@***.***> wrote:
This will lead to starvation if an EDProducer produces multiple CUDA products that get consumed by downstream. Would just removing the stream_->is_clear() from the if be sufficient?
apparently yes for what concern the meaningless beamspot.
still crashing in multijob (usually at the very beginning as in ~/data/crashMultiJobAfterFixInContext_3.log)
v.
|
Thanks. In the mean time I poked around more and realized that there indeed is a synchronization mistake that comes up now with the
The multijob crash must be something different as it was first reported 3.5 weeks ago in #306. |
On 22 Apr, 2019, at 6:18 PM, Matti Kortelainen ***@***.***> wrote:
Thanks. In the mean time I poked around more and realized that there indeed is a synchronization mistake that comes up now with the BeamSpotToCUDAproducer (requirements are basically: no ExternalWork, use of ctx.emplace(), and queueing asynchronous work in the constructor of the CUDA product). I'll submit a fix later today.
ok, let me know as I used the same pattern for TrackingRecHit in #322
|
The pattern itself is fine (and I want to keep it), so I'll make a fix under the hoods. |
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
PR description:
This PR is a followup to #245 and makes the first attempt to transfer BeamSpot data to GPU in its own producer instead of in rechit producer. I left the covariance matrix for subsequent work as it is not strictly needed at the moment, and currently Eigen apparently does not support minimal storage for symmetric matrices.
In addition, a perfect forwarding overload is added for
CUDAProduct
constructor enabling the use ofCUDAScopedContext::emplace()
inBeamSpotToCUDA::produce()
.As in #245, a mechanism is added to create non-cached pinned host memory
unique_ptr
s with the possibility to pass custom flags tocudaHostAlloc()
. There is one commit for moving the BeamSpot transfer to use once-per-stream-per-job allocated write-combined buffer, and another commit for doing the same for the raw data. A difference wrt. #245 is that the emptyGPU::SimpleVector<PixelErrorCompact>
is still transferred via a pinned host memory from the caching allocator (with the currentSiPixelDigiErrorsCUDA
providing the transfer buffer outside of the class would look ugly, but could be done if really wanted).PR validation:
Tested that a profile configuration runs, and with
nvprof
that the BeamSpot transfer can occur in parallel to e.g. clustering kernels.